I get asked this question a lot as a data scientist working with clinicians and medical students. You, as a clinician, are a busy person, but you are curious and want to contribute to the future of technology in healthcare. You may have heard of Python and R as being the de-facto standard languages for data science. However, you, just like most people entering data science across the world are curious as to which language you should use.
Here is a summary of my thoughts on this whole debate. For reference, I’ve been programming in R and Python for about 5 years now, specifically within the realm of data science in health care. Before I get to the actual content, I want to say the following:
That being said, there are differences between the two languages and those differences may contribute to which you choose. If you are lazy and don’t want to read the whole article:
Also, there is an infographic from datacamp that shows some of the differences between R and Python here
However, I don’t agree with some of the points made in the infographic, so take it with a grain of salt.
What I plan to do over the course of this piece is to highlight the differences between R and Python and give my opinion about which language performs which function better. Here’s a rough outline:
It might seems strange to begin with a discussion of add-ons and the community behind the two languages without even talking about the languages themselves to begin with. However, the community behind a programming language is actually one of the most important metrics of success and general usability of a language. Add-ons, known as libraries or packages, can turn languages that are frustrating to program in to languages that are a pleasure to write. My thoughts about the ecosystems of the two languages can be summarized quite succintly:
R’s ecosystem is very user-friendly and is specialized towards data manipulation, visualization, and statistical inference. In addition, R is primarily a data science language first and foremost, meaning most if not all of the packages are designed to fit within a data science framework.
Python’s data science ecosystem is also quite well developed. However, they sometimes are not as easy to use as R’s packages. That being said, the trade-off is that python is much more flexible language than R. Because python’s origins are as a general purpose language, it offers some functionality that is harder to replicate in R.
As a quick note, because this is something that comes up a lot: If you are looking to implement custom machine learning models, python is a better environment to do so. Most of the cutting-edge deep learning research is conducted in python. R has more packages geared towards complicated statistical techniques, such as those that arise in genetics research. This actually reflects the main differences between users between the two languages. Traditionally, academics and researchers have preferred to use R due to all of the statistical packages that are available. People coming from software engineering backgrounds tend to prefer python because of its flexibility as a general-purpose programming language.
With that, let’s dive into examples of functionality in both languages so you can see for yourself which you prefer:
Obviously, one of the things that you need to be able to do in either language is to work with the data itself. Let’s start by reading in some data!
diabetes_data <- read.csv("./dataset_diabetes/diabetic_data.csv")
head(diabetes_data)
## encounter_id patient_nbr race gender age weight
## 1 2278392 8222157 Caucasian Female [0-10) ?
## 2 149190 55629189 Caucasian Female [10-20) ?
## 3 64410 86047875 AfricanAmerican Female [20-30) ?
## 4 500364 82442376 Caucasian Male [30-40) ?
## 5 16680 42519267 Caucasian Male [40-50) ?
## 6 35754 82637451 Caucasian Male [50-60) ?
## admission_type_id discharge_disposition_id admission_source_id
## 1 6 25 1
## 2 1 1 7
## 3 1 1 7
## 4 1 1 7
## 5 1 1 7
## 6 2 1 2
## time_in_hospital payer_code medical_specialty num_lab_procedures
## 1 1 ? Pediatrics-Endocrinology 41
## 2 3 ? ? 59
## 3 2 ? ? 11
## 4 2 ? ? 44
## 5 1 ? ? 51
## 6 3 ? ? 31
## num_procedures num_medications number_outpatient number_emergency
## 1 0 1 0 0
## 2 0 18 0 0
## 3 5 13 2 0
## 4 1 16 0 0
## 5 0 8 0 0
## 6 6 16 0 0
## number_inpatient diag_1 diag_2 diag_3 number_diagnoses max_glu_serum
## 1 0 250.83 ? ? 1 None
## 2 0 276 250.01 255 9 None
## 3 1 648 250 V27 6 None
## 4 0 8 250.43 403 7 None
## 5 0 197 157 250 5 None
## 6 0 414 411 250 9 None
## A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride
## 1 None No No No No No
## 2 None No No No No No
## 3 None No No No No No
## 4 None No No No No No
## 5 None No No No No No
## 6 None No No No No No
## acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone
## 1 No No No No No No
## 2 No No No No No No
## 3 No Steady No No No No
## 4 No No No No No No
## 5 No Steady No No No No
## 6 No No No No No No
## acarbose miglitol troglitazone tolazamide examide citoglipton insulin
## 1 No No No No No No No
## 2 No No No No No No Up
## 3 No No No No No No No
## 4 No No No No No No Up
## 5 No No No No No No Steady
## 6 No No No No No No Steady
## glyburide.metformin glipizide.metformin glimepiride.pioglitazone
## 1 No No No
## 2 No No No
## 3 No No No
## 4 No No No
## 5 No No No
## 6 No No No
## metformin.rosiglitazone metformin.pioglitazone change diabetesMed
## 1 No No No No
## 2 No No Ch Yes
## 3 No No No Yes
## 4 No No Ch Yes
## 5 No No Ch Yes
## 6 No No No Yes
## readmitted
## 1 NO
## 2 >30
## 3 NO
## 4 NO
## 5 NO
## 6 >30
We’ll use this dataset to illustrate some of the differences between using the base language and using packages to help us. Let’s say we want to find which combinations of race and gender have more medications on average and sort them according to that number. In the base R language, we could do that like this:
# Base language
result <- aggregate(diabetes_data$num_medications, by=list(diabetes_data$race, diabetes_data$gender), FUN=mean)
result <- result[order(result$x, decreasing = TRUE),]
head(result)
## Group.1 Group.2 x
## 14 Other Unknown/Invalid 22.00000
## 4 Caucasian Female 16.46434
## 10 Caucasian Male 16.09105
## 7 ? Male 15.88576
## 1 ? Female 15.74492
## 2 AfricanAmerican Female 15.63446
Some of the syntax can seem foreign at first glance. Luckily, the dplyr package makes this operation trivial, and more importantly, readable
library(dplyr) # Import the dplyr library
result <- diabetes_data %>%
group_by(race, gender) %>%
summarize(average_medications = mean(num_medications)) %>%
arrange(desc(average_medications))
head(result)
## # A tibble: 6 x 3
## # Groups: race [4]
## race gender average_medications
## <fct> <fct> <dbl>
## 1 Other Unknown/Invalid 22
## 2 Caucasian Female 16.5
## 3 Caucasian Male 16.1
## 4 ? Male 15.9
## 5 ? Female 15.7
## 6 AfricanAmerican Female 15.6
In general, many of the packages that are available in R are user-friendly and powerful. The top packages are extremely well-designed in large part due to Hadley Wickham, who is a famous contributor to the R ecosystem and has really ensured that it has cemented its place as a core data science language.
filtering_example <- diabetes_data %>%
filter(gender == "Female") %>%
filter(race == "Caucasian") %>%
filter(num_medications > 8)
head(filtering_example)
## encounter_id patient_nbr race gender age weight
## 1 149190 55629189 Caucasian Female [10-20) ?
## 2 12522 48330783 Caucasian Female [80-90) ?
## 3 15738 63555939 Caucasian Female [90-100) ?
## 4 40926 85504905 Caucasian Female [40-50) ?
## 5 84222 108662661 Caucasian Female [50-60) ?
## 6 183930 107400762 Caucasian Female [80-90) ?
## admission_type_id discharge_disposition_id admission_source_id
## 1 1 1 7
## 2 2 1 4
## 3 3 3 4
## 4 1 3 7
## 5 1 1 7
## 6 2 6 1
## time_in_hospital payer_code medical_specialty num_lab_procedures
## 1 3 ? ? 59
## 2 13 ? ? 68
## 3 12 ? InternalMedicine 33
## 4 7 ? Family/GeneralPractice 60
## 5 3 ? Cardiology 29
## 6 11 ? ? 42
## num_procedures num_medications number_outpatient number_emergency
## 1 0 18 0 0
## 2 2 28 0 0
## 3 3 18 0 0
## 4 0 15 0 1
## 5 0 11 0 0
## 6 2 19 0 0
## number_inpatient diag_1 diag_2 diag_3 number_diagnoses max_glu_serum
## 1 0 276 250.01 255 9 None
## 2 0 398 427 38 8 None
## 3 0 434 198 486 8 None
## 4 0 428 250.43 250.6 8 None
## 5 0 682 174 250 3 None
## 6 0 V57 715 V43 8 None
## A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride
## 1 None No No No No No
## 2 None No No No No No
## 3 None No No No No No
## 4 None Steady Up No No No
## 5 None No No No No No
## 6 None No No No No No
## acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone
## 1 No No No No No No
## 2 No Steady No No No No
## 3 No No No No No Steady
## 4 No No No No No No
## 5 No No Steady No No No
## 6 No No No No No No
## acarbose miglitol troglitazone tolazamide examide citoglipton insulin
## 1 No No No No No No Up
## 2 No No No No No No Steady
## 3 No No No No No No Steady
## 4 No No No No No No Down
## 5 No No No No No No No
## 6 No No No No No No No
## glyburide.metformin glipizide.metformin glimepiride.pioglitazone
## 1 No No No
## 2 No No No
## 3 No No No
## 4 No No No
## 5 No No No
## 6 No No No
## metformin.rosiglitazone metformin.pioglitazone change diabetesMed
## 1 No No Ch Yes
## 2 No No Ch Yes
## 3 No No Ch Yes
## 4 No No Ch Yes
## 5 No No No Yes
## 6 No No No No
## readmitted
## 1 >30
## 2 NO
## 3 NO
## 4 <30
## 5 NO
## 6 >30
In python, the standard library that is used to work with data is known as pandas.
import pandas as pd
diabetes_data = pd.read_csv("./dataset_diabetes/diabetic_data.csv")
print(diabetes_data.head())
## encounter_id patient_nbr ... diabetesMed readmitted
## 0 2278392 8222157 ... No NO
## 1 149190 55629189 ... Yes >30
## 2 64410 86047875 ... Yes NO
## 3 500364 82442376 ... Yes NO
## 4 16680 42519267 ... Yes NO
##
## [5 rows x 50 columns]
diabetes_python = (diabetes_data
.groupby(['race', 'gender'], as_index = False)['num_medications']
.mean()
.sort_values(by = ['num_medications'], ascending = False))
print(diabetes_python.head(6))
## race gender num_medications
## 13 Other Unknown/Invalid 22.000000
## 7 Caucasian Female 16.464335
## 8 Caucasian Male 16.091046
## 1 ? Male 15.885764
## 0 ? Female 15.744925
## 3 AfricanAmerican Female 15.634465
pandas is a very powerful, but complex package. There are a lot of parameters to think about because of how flexible the package is. For example, things like the as_index = False call above can be hard to find if you don’t know what you’re looking for. To find that particular case, you would have to look at the documentation for the groupby function, which is located here. I would recommend that you take a look to see how many ways you can modify this function with parameters. You can compare that to dplyr’s reference here.
filtering_example = (diabetes_data[
(diabetes_data['gender'] == "Female") &
(diabetes_data['race'] == "Caucasian") &
(diabetes_data['num_medications'] > 8)])
print(filtering_example.head())
## encounter_id patient_nbr ... diabetesMed readmitted
## 1 149190 55629189 ... Yes >30
## 8 12522 48330783 ... Yes NO
## 9 15738 63555939 ... Yes NO
## 12 40926 85504905 ... Yes <30
## 17 84222 108662661 ... Yes NO
##
## [5 rows x 50 columns]
From personal experience, I would say that R, due to tools like dplyr and tidyr is a more friendly language for data manipulation and it is generally easier to figure out what you are doing. In addition, the pandas documentation is huge and can be intimidating to new users. As a comparison, the are about 10 main functions that you need to learn in dplyr, whereas pandas contains hundreds of functions that all serve different purposes.
Both R and Python are great at visualization. I honestly would call it almost a toss-up here. ggplot2 is the standard plotting library in R and it is fantastic. Python’s visualization frameworks are a bit more fragmented. The traditional standard for scientific plotting is known as matplotlib whereas some prettier graphs can be made using seaborn. For that reason, R might have the slight edge, although the difference is minimal.
Let’s try an example in ggplot2!
library(ggplot2)
ggplot(data = diabetes_data) +
geom_boxplot(aes(x = interaction(gender, race), y = num_medications, fill = gender)) +
labs(title = "Test Plot", x = "Gender + Race", y = "Number of Medications")
Ok, the x-ticks are annoying. Fixing it requires a bit of complexity, so I’ll go ahead and illustrate that as well. Oftentimes there are hiccups like this that can cause frustration when dealing with visualization, which happens in both python and R.
library(stringr) # Work with strings! (characters)
# Full disclosure -- I found the solution at stackoverflow:
# https://stackoverflow.com/questions/50047331/only-show-one-part-of-interacting-x-variable-in-x-axis-labels-in-ggplot
# Stackoverlow is the programming bible -- if you know how to properly phrase your question, chances is that someone has
# already answered it. Think of it like uptodate but for programmers
make_labels <- function(labels) {
result <- str_split(labels, "\\.")
unlist(lapply(result, function(x) x[2]))
}
# This function takes in the 'labels' from the graph
# Then, it splits them after the period and takes the 2nd part, which is what we want here.
# the "\\." is what is known as a regular expression which is a great, but seeminlgy complex, tool for parsing strings
# For more on regular expressions, you can read about them here: https://www.regular-expressions.info
diabetes_viz <- ggplot(data = diabetes_data) +
geom_boxplot(aes(x = interaction(gender, race), y = num_medications, fill = gender)) +
labs(title = "Test Plot", y = "Number of Medications") + scale_x_discrete(labels = make_labels, name = "Race") +
theme(axis.text.x = element_text(angle = 45, vjust = 0.5))
diabetes_viz
In python, the main plotting libraries are matplotlib and seaborn. Matplotlib is generally used for quick graphs to visualize some data that you aren’t going to present, since it doesn’t generally look as nice. Although matplotlib can be used to visualize data that does not come from a Pandas DataFrame, Pandas actually has some nice built-in tools to help the process.
import matplotlib.pyplot as plt
diabetes_python.boxplot(['num_medications'], by = ['race', 'gender'])
As you can see, however, this doesn’t really look good. Although matplotlib allows you to customize virtually every aspect of this figure, it can often be easier to just use seaborn, which is a way to create prettier visualizations:
import seaborn as sns
sns_example = sns.boxplot(x = 'race', y = 'num_medications', hue = 'gender', data = diabetes_python)
I made the figure bigger this time so the x-ticks wouldn’t run into each other, but there are custom ways of dealing with this in python as well.
Visualization can be tedious in any language just because of the sheer amount of customizability that is almost mandatory to make the figures exactly the way you want. Since visualization is so important, the libraries used to create them are generally very well-written and documented, so there isn’t really a wrong way to go here.
Machine Learning is a hot topic in healthcare and most other industries. Both R and Python offer excellent tools for machine learning. The two main packages that are used for machine learning in R and Python are caret and Scikit-Learn, respectively. Although they differ in their approaches, both are very usable. For most of the common machine learning techniques that were invented prior to 2010, you can probably find an implementation in these two packages. That includes things like,
If you’re dealing with models like this, both python and R should be completely fine. However, if you want to start playing around with custom models, this is where python’s ecosystem make it better. Almost no deep learning work occurs in R, because Tensorflow, which is google’s neural network library, supports python a lot better.
That being said, I wouldn’t make a decision on which language to choose simply due to the machine learning frameworks available in either language. If you want to do something with medical images or natural language processing, python is the better choice. Otherwise, they are essentially identical.
R is better for statistics. Period. Most statisticians use R, and if there is a special type of hypothesis test or some other statistical analysis tool, it will most likely be implemented in R before it is implemented in python.
If you are looking to do basic statistical analysis, both languages are fine. As soon as you venture into the world of custom statistical procedures, you’re better off learning R.
One underappreciated aspect of programming languages for data science is how to present your work. Here, I think R is the winner simply because of how easy to use it is. R has something known as R Markdown, which this entire article is written in! It allows you to have in-line code like you’ve seen and it has a bunch of bells and whistles for showing your work afterwards. For example, I can include interactive visualizations natively into these documents!
# You can hover here for some cool effects!
library(plotly) # package for interactive visualizations
interactive_df <- ggplotly(diabetes_viz)
interactive_df
Python has its own version of this type of framework, known as Jupyter Notebooks. Actually, Jupyter Notebooks can work with R also. However, they’re not as easy to use, in my opinion. At this point in their development, both Rmarkdown and Jupyter Notebooks have similar functionality, it all comes down ease of use. You can find out more about jupyter notebooks here.
What if you want to present a web application of your analysis? Here, I think R’s web-app framework, known as shiny is great. It allows you to rapidly spin up applications that feature interactive visualizations that offer easy hosting. Python’s versions of this are much more fragmented, and may require more pre-requisite knowledge and more of a programming background to set up. There are general web-app frameworks like django and flask, which are definitely not suited to beginning data science programmers. dash, which is made by plotly for python, offers similar functionality to shiny, but it is nowhere near as robust. You can find really cool examples of modules that you can integrate seamlessly into shiny apps here. Here are some links that you can use to check out for examples of shiny apps and dash apps, which I would say is a fair comparison:
R, with its Shiny infrastructure, definitely wins in terms of ease-of-use here.
The two languages differ quite a bit when it comes to installation of packages/libraries. In R, you can install any package you want by simply opening an R console and typing install.packages('<name of your package>'). In python, you do this from the command line, and there are lots of different ways to install packages. For example, two common package managers are conda and pip. The trade-off here is pretty clear. Python’s package management is more fragmented, but gives you greater control about what versions of packages you have installed. So for example, let’s say you’re using python version 2.7 and you want pandas version 1.10. That is something that you can easily do in python. R, however, doesn’t really encourage this as much as Python. That may be due to the fact that R updates rarely break old code (in my experience). In addition, most code written in prior versions of packages continue to function down the line. However, if you are putting some code into production, that is definitely not a viable solution. This whole scenario echoes the general differences behind R and Python’s main users: R tends to attract data scientists and researchers doing one-off analyses whereas python attracts engineers and people looking to do custom work and putting it into production.
An IDE is where most people write their data science programs and scripts. For R, the main IDE is RStudio. RStudio is one of the best IDEs in existence. It has everything a modern IDE should have and is custom-tailored for R. For python, the main IDEs for Data Science are Jupyter Notebooks and Spyder. There are many others that are probably better suited for software development, but those are the main 2 for python. Jupyter Lab, which includes Jupyter Notebooks as a subset, is also getting better in terms of its functionality, but I would say that R wins hands-down in this department.
Rstudio can be downloaded here whereas instructions for installing jupyter notebooks can be found here
Hopefully this has exposed you a little bit to the differences between R and Python and ways that they are similar. In my opinion, it is difficult to make a wrong choice here. A lot of the process of learning tranfers directly between the two languages. If you are just getting started, I would suggest just making a decision and sticking with it. If you have any questions, or feedback (things you’d like to see added, etc.), feel free to email me at michael.gao@duke.edu